arxiv: 2604.09544 · v1 · submitted 2026-04-10 · 💻 cs.CL · cs.AI· cs.LG

Recognition: unknown

Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism

Hadas Orgad , Boyi Wei , Kaden Zheng , Martin Wattenberg , Peter Henderson , Seraphina Goldfarb-Tarrant , Yonatan Belinkov

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:04 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords large language modelsmodel alignmentharmful contentweight pruningemergent misalignmentinternal representationssafety mechanisms

0 comments

The pith

Harmful content generation in large language models depends on a compact, shared set of weights distinct from benign capabilities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors apply targeted weight pruning to probe how LLMs organize their capacity for harmful outputs. They establish that this capacity rests on a small collection of weights that handle diverse harm types while remaining separate from the weights used for normal tasks. Alignment training increases the compression of these harm-related weights even though surface-level safeguards stay fragile. This internal compression accounts for why fine-tuning on limited harmful material can produce widespread misalignment across many domains. Pruning the relevant weights within one narrow domain substantially limits this broad effect, and the generation of harm operates independently from the model's ability to identify or describe it.

Core claim

Harmful content generation depends on a compact set of weights that are general across harm types and distinct from benign capabilities. Aligned models exhibit a greater compression of harm generation weights than unaligned counterparts. This compression explains emergent misalignment because fine-tuning that engages these weights in one domain triggers broad misalignment. Pruning harm generation weights in a narrow domain substantially reduces emergent misalignment. The harmful generation capability is dissociated from how models recognize and explain such content.

What carries the argument

Targeted weight pruning used as a causal intervention to identify and isolate the compact set of weights responsible for harmful content generation.

Load-bearing premise

Targeted weight pruning functions as a clean causal intervention that isolates harm generation without broadly disrupting other behaviors or introducing mimicking artifacts.

What would settle it

Finding that models continue to generate harmful content at similar rates after pruning the identified weights, or that such pruning degrades performance on unrelated benign tasks to a comparable degree.

Figures

Figures reproduced from arXiv: 2604.09544 by Boyi Wei, Hadas Orgad, Kaden Zheng, Martin Wattenberg, Peter Henderson, Seraphina Goldfarb-Tarrant, Yonatan Belinkov.

**Figure 1.** Figure 1: LLMs encode harmful generation in a compact set of weights, distinct from benign capabilities and general [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Alignment training increases compression of harmful generation weights [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Pruning harmful generation leaves reasoning about harm intact. (a) Effect of removing harmful-generation [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Utility–harmfulness trade-off under different pruning targets. Pink curves show pruning of harmful gener [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: A comparison between pruning the top q of top harmful set of weights versus freezing the top q and pruning the 2nd most harmful set of weights. We find that the 2nd most harmful set can also reduce the harmfulness capabilities of the model. For llama, the 2nd most harmful set of weights results in a larger reduction in utility, which may explain the lower harmfulness scores. For the Qwen models, we observe… view at source ↗

**Figure 6.** Figure 6: Per-layer Jaccard similarity of top-k pruned weight sets across category pairs for Llama-3.1-8B-Instruct (a– e), Qwen2.5-14B-Instruct (f–j), and Qwen2.5-32B-Instruct (k–o). For each model, the first row shows pairs of harmful categories and the second row shows harmful-vs-control (TriviaQA) pairs. Across all three models, pairs of harmful categories consistently exhibit higher overlap than harmful-vs-Trivi… view at source ↗

**Figure 7.** Figure 7: Regions contributing to EM overlap across datasets. We report the average Jaccard index of the pruned [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: Utility-Harmfulness tradeoff comparison between pretrained and instruct models, prefilling jailbreak. [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: Utility-Harmfulness tradeoff comparison between pretrained and instruct models, refusal ablation + prefilling [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: Utility–harmfulness trade-off under prefilling attack for Qwen2.5 instruct models at 1.5B, 7B, 14B, and [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗

**Figure 11.** Figure 11: Cross-capability pruning effects. Each cell shows the change in capability relative to unpruned baseline, [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗

**Figure 12.** Figure 12: A plot demonstrating why the detection circuit cannot be selectively pruned from Llama-8B-Instruct with [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗

**Figure 13.** Figure 13: Pairwise Jaccard indices across all capability pruned weights. The weight sets identified for each capability [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗

**Figure 14.** Figure 14: Fine-tuning partially restores harmful generation in pruned models. [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗

**Figure 15.** Figure 15: Distribution of judge-assessed usefulness of model responses to harmful requests after fine-tuning and [PITH_FULL_IMAGE:figures/full_fig_p029_15.png] view at source ↗

read the original abstract

Large language models (LLMs) undergo alignment training to avoid harmful behaviors, yet the resulting safeguards remain brittle: jailbreaks routinely bypass them, and fine-tuning on narrow domains can induce ``emergent misalignment'' that generalizes broadly. Whether this brittleness reflects a fundamental lack of coherent internal organization for harmfulness remains unclear. Here we use targeted weight pruning as a causal intervention to probe the internal organization of harmfulness in LLMs. We find that harmful content generation depends on a compact set of weights that are general across harm types and distinct from benign capabilities. Aligned models exhibit a greater compression of harm generation weights than unaligned counterparts, indicating that alignment reshapes harmful representations internally--despite the brittleness of safety guardrails at the surface level. This compression explains emergent misalignment: if weights of harmful capabilities are compressed, fine-tuning that engages these weights in one domain can trigger broad misalignment. Consistent with this, pruning harm generation weights in a narrow domain substantially reduces emergent misalignment. Notably, LLMs harmful generation capability is dissociated from how they recognize and explain such content. Together, these results reveal a coherent internal structure for harmfulness in LLMs that may serve as a foundation for more principled approaches to safety.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Pruning identifies a compact cross-harm weight set that alignment compresses, but the evidence for it being a distinct unified mechanism rather than general capacity loss is still thin.

read the letter

The paper's central finding is that harmful generation across categories relies on a small shared set of weights that can be pruned out, and that aligned models have compressed these weights more than unaligned ones. This compression is offered as the reason narrow fine-tuning triggers broad misalignment, and pruning the weights in one domain reduces that spillover. They also note that generation is separate from the model's ability to recognize or explain harm.

Referee Report

2 major / 1 minor

Summary. The manuscript uses targeted weight pruning as a causal intervention to probe the internal organization of harmfulness in LLMs. It claims that harmful content generation depends on a compact set of weights that are general across harm types and distinct from benign capabilities. Aligned models exhibit greater compression of these harm-generation weights than unaligned models. This compression is proposed to explain both the brittleness of surface-level safety guardrails and the broad generalization of emergent misalignment after narrow-domain fine-tuning. The authors report that pruning the identified harm weights in a narrow domain reduces emergent misalignment and that harmful generation is dissociated from the model's ability to recognize and explain such content.

Significance. If the pruning intervention can be shown to isolate harm-related weights with adequate specificity controls, the work would offer a mechanistic account of why alignment remains brittle and why emergent misalignment occurs. The use of pruning to identify a unified, compressible harm mechanism across aligned and unaligned models, together with the link to mitigation of emergent misalignment, provides a concrete empirical foundation that could inform more targeted safety methods. The dissociation between generation and recognition/explanation is also a notable observation.

major comments (2)

[Pruning experiments (methods and results sections)] The central claim that a distinct, unified set of weights controls harmful generation (and that alignment produces greater compression of these weights) rests on the assumption that the pruning intervention is specific rather than a nonspecific reduction in model capacity. The manuscript does not report control experiments that prune equivalent numbers of randomly selected weights or weights associated with benign tasks and compare the resulting effects on harmful versus benign outputs.
[Emergent misalignment results] The explanation that compression of harm weights accounts for emergent misalignment requires evidence that pruning the identified weights does not broadly impair unrelated capabilities. Without reporting performance on control tasks after pruning (or statistical comparisons to random pruning), the reduction in emergent misalignment could be an artifact of general degradation rather than removal of a specific mechanism.

minor comments (1)

[Abstract] The abstract states that pruning 'substantially reduces emergent misalignment' but does not specify the narrow domain used, the exact pruning threshold or percentage of weights removed, or the evaluation metrics and statistical tests applied.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which highlights important aspects of experimental rigor needed to support our causal claims. We agree that additional specificity controls would strengthen the manuscript and will incorporate them in the revision.

read point-by-point responses

Referee: [Pruning experiments (methods and results sections)] The central claim that a distinct, unified set of weights controls harmful generation (and that alignment produces greater compression of these weights) rests on the assumption that the pruning intervention is specific rather than a nonspecific reduction in model capacity. The manuscript does not report control experiments that prune equivalent numbers of randomly selected weights or weights associated with benign tasks and compare the resulting effects on harmful versus benign outputs.

Authors: We agree that the absence of these controls leaves open the possibility of nonspecific capacity reduction. Our original experiments focused on showing differential effects on harmful versus benign outputs after targeted pruning, but did not include random-weight or benign-task pruning baselines. In the revised manuscript we will add these controls, pruning matched numbers of random weights and weights associated with benign capabilities, then directly comparing impacts on harmful generation versus benign task performance. revision: yes
Referee: [Emergent misalignment results] The explanation that compression of harm weights accounts for emergent misalignment requires evidence that pruning the identified weights does not broadly impair unrelated capabilities. Without reporting performance on control tasks after pruning (or statistical comparisons to random pruning), the reduction in emergent misalignment could be an artifact of general degradation rather than removal of a specific mechanism.

Authors: We acknowledge that stronger evidence is needed to rule out general degradation. While we monitored overall model capability after pruning, the manuscript did not report detailed control-task performance or statistical comparisons to random pruning. We will revise the results section to include performance on unrelated control tasks post-pruning together with statistical comparisons against random-pruning baselines, allowing readers to assess whether the observed reduction in emergent misalignment is mechanism-specific. revision: yes

Circularity Check

0 steps flagged

No significant circularity; pruning acts as external causal intervention rather than definitional fit.

full rationale

The paper's central results derive from applying targeted weight pruning as an intervention and measuring downstream effects on harm generation, cross-type generalization, and alignment differences. These outcomes are not presupposed by the analysis or reduced to fitted parameters renamed as predictions. No self-citation chains, ansatzes, or uniqueness theorems are invoked to force the compact-set or compression claims. The dissociation between generation and recognition is reported as a separate empirical observation. Minor self-citation risk exists in related work but is not load-bearing for the pruning-based findings.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract provides no explicit free parameters or invented entities; relies on standard assumption that pruning selectively removes capability without side effects.

axioms (1)

domain assumption Targeted weight pruning isolates causal mechanisms for specific behaviors without introducing confounding changes to model representations.
Invoked implicitly when interpreting pruning results as revealing the internal organization of harmfulness.

pith-pipeline@v0.9.0 · 5541 in / 1222 out tokens · 39575 ms · 2026-05-10T17:04:01.021020+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 22 canonical work pages · 6 internal anchors

[1]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Emergent misalignment: Narrow finetuning can produce broadly misaligned LLMs, 2025

URLhttps: //arxiv.org/abs/2502.17424. Jan Betley, Niels Warncke, Anna Sztyber-Betley, Daniel Tan, Xuchan Bao, Mart ´ın Soto, Megha Srivastava, Nathan Labenz, and Owain Evans. Training large language models on narrow tasks can lead to broad misalignment.Nature, 649(8097):584–589,

work page arXiv
[3]

Purple llama CyberSecEval : A secure coding benchmark for language models

Manish Bhatt, Sahana Chennabasappa, Cyrus Nikolaidis, Shengye Wan, Ivan Evtimov, Dominik Gabi, Daniel Song, Faizan Ahmad, Cornelius Aschermann, Lorenzo Fontana, et al. Purple llama cyberseceval: A secure coding bench- mark for language models.arXiv preprint arXiv:2312.04724,

work page arXiv
[4]

arXiv:1802.07228, 2018

Miles Brundage, Shahar Avin, Jack Clark, Helen Toner, Peter Eckersley, Ben Garfinkel, Allan Dafoe, Paul Scharre, Thomas Zeitzoff, Bobby Filar, et al. The malicious use of artificial intelligence: Forecasting, prevention, and mitigation.arXiv preprint arXiv:1802.07228,

work page arXiv
[5]

Proceedings of the IEEE Symposium on Security and Privacy , year =

doi: 10.1109/SP.2015.35. Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 ...

work page doi:10.1109/sp.2015.35 2015
[6]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

9 Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

URL https://www.science.org/doi/abs/10.1126/science.7414331

doi: 10.1126/science.7414331. URL https://www.science.org/doi/abs/10.1126/science.7414331. Team Cohere, Arash Ahmadian, Marwan Ahmed, Jay Alammar, Milad Alizadeh, Yazeed Alnumay, Sophia Altham- mer, Arkady Arkhangorodsky, Viraat Aryabumi, Dennis Aumiller, et al. Command a: An enterprise-ready large language model.arXiv preprint arXiv:2504.00698,

work page doi:10.1126/science.7414331
[8]

Evelina Fedorenko, Steven T Piantadosi, and Edward AF Gibson

Accessed: 2025-07-01. Evelina Fedorenko, Steven T Piantadosi, and Edward AF Gibson. Language is primarily a tool for communication rather than thought.Nature, 630(8017):575–586,

2025
[9]

Accessed: 2025-07-12

URLhttps://cloud.google.com/blog/topics/threat-intelligence/ adversarial-misuse-generative-ai. Accessed: 2025-07-12. Anjali Gopal, Nathan Helm-Burger, Lennart Justen, Emily H Soice, Tiffany Tzeng, Geetha Jeyapragasan, Simon Grimm, Benjamin Mueller, and Kevin M Esvelt. Will releasing the weights of future large language models grant widespread access to pa...

work page arXiv 2025
[10]

Accessed: 2025-07-12

URLhttps://www.theguardian.com/technology/2025/feb/01/ stalking-ai-chatbot-impersonator. Accessed: 2025-07-12. Tal Haklay, Hadas Orgad, David Bau, Aaron Mueller, and Yonatan Belinkov. Position-aware automatic circuit discov- ery. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2792–2817,

2025
[11]

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding.arXiv preprint arXiv:1510.00149,

work page internal anchor Pith review arXiv
[12]

Large language models can be used to effectively scale spear phishing campaigns

Julian Hazell. Spear phishing with large language models.arXiv preprint arXiv:2305.06972,

work page arXiv
[13]

arXiv preprint arXiv:2306.12001 , year=

Dan Hendrycks, Mantas Mazeika, and Thomas Woodside. An overview of catastrophic ai risks.arXiv preprint arXiv:2306.12001,

work page arXiv
[14]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

URL https://openreview.net/forum?id=r42tSSCHPh. Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674,

work page internal anchor Pith review arXiv
[15]

Baseline Defenses for Adversarial Attacks Against Aligned Language Models

Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping-yeh Chiang, Micah Gold- blum, Aniruddha Saha, Jonas Geiping, and Tom Goldstein. Baseline defenses for adversarial attacks against aligned language models.arXiv preprint arXiv:2309.00614,

work page internal anchor Pith review arXiv
[16]

Accessed: 2025-07-

URLhttps://www.theguardian.com/technology/2025/may/18/ musks-ai-bot-grok-blames-its-holocaust-scepticism-on-programming-error. Accessed: 2025-07-

2025
[17]

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal

URLhttps://openreview.net/ forum?id=B1VZqjAcYX. Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2381–2391,

2018
[18]

Team Olmo, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, et al

URLhttps: //arxiv.org/abs/2504.04299. Team Olmo, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, et al. Olmo 3.arXiv preprint arXiv:2512.13961,

work page arXiv
[19]

arXiv preprint arXiv:2403.01267 , year=

Accessed: 2025-07-01. Nicholas Pochinkov and Nandi Schoots. Dissecting language models: Machine unlearning via selective pruning.arXiv preprint arXiv:2403.01267,

work page arXiv 2025
[20]

Safety alignment should be made more than just a few tokens deep.CoRR, abs/2406.05946,

Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, and Peter Henderson. Safety alignment should be made more than just a few tokens deep.arXiv preprint arXiv:2406.05946, 2024a. Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models comp...

work page arXiv
[21]

Revisiting the robust alignment of circuit breakers.CoRR, abs/2407.15902,

Leo Schwinn and Simon Geisler. Revisiting the robust alignment of circuit breakers.CoRR, abs/2407.15902,

work page arXiv
[22]

Revisiting the robust alignment of circuit breakers.CoRR, abs/2407.15902,

URLhttps://doi.org/10.48550/arXiv.2407.15902. Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, et al. A strongreject for empty jailbreaks.Advances in Neural Information Process- ing Systems, 37:125416–125440,

work page doi:10.48550/arxiv.2407.15902
[23]

arXiv preprint arXiv:2506.11613 , year=

Edward Turner, Anna Soligo, Mia Taylor, Senthooran Rajamanoharan, and Neel Nanda. Model organisms for emer- gent misalignment.arXiv preprint arXiv:2506.11613,

work page arXiv
[24]

Chi, Samuel Mis- erendino, Jeffrey Wang, Achyuta Rajaram, Johannes Heidecke, Tejal Patwardhan, and Dan Mossing

Miles Wang, Tom Dupr ´e la Tour, Olivia Watkins, Alex Makelov, Ryan A Chi, Samuel Miserendino, Jeffrey Wang, Achyuta Rajaram, Johannes Heidecke, Tejal Patwardhan, et al. Persona features control emergent misalignment. arXiv preprint arXiv:2506.19823,

work page arXiv
[25]

Defending llms against jailbreaking attacks via back- translation

Yihan Wang, Zhouxing Shi, Andrew Bai, and Cho-Jui Hsieh. Defending llms against jailbreaking attacks via back- translation. InFindings of the Association for Computational Linguistics ACL 2024, pp. 16031–16046,

2024
[26]

Wei Zhao, Zhe Li, Yige Li, Ye Zhang, and Jun Sun

URLhttps: //openreview.net/forum?id=zLkpt30ngy. Wei Zhao, Zhe Li, Yige Li, Ye Zhang, and Jun Sun. Defending large language models against jailbreak attacks via layer-specific editing.arXiv preprint arXiv:2405.18166,

work page arXiv
[27]

Universal and Transferable Adversarial Attacks on Aligned Language Models

URLhttps://arxiv.org/abs/2307.15043. Andy Zou, Long Phan, Justin Wang, Derek Duenas, Maxwell Lin, Maksym Andriushchenko, J Zico Kolter, Matt Fredrikson, and Dan Hendrycks. Improving alignment and robustness with circuit breakers. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

If no configuration meets this threshold, we select the one with the lowest StrongRE- JECT score among configurations whose utility remains within 95% of the unpruned model

12 A Implementation Details A.1 Pruning Implementation Details Parameters Search.We select pruning sparsity levelspandqusing the following strategy: we choose the configu- ration that achieves the highest utility (approximated by TriviaQA accuracy) while keeping the StrongREJECT score below0.1on the validation data. If no configuration meets this threshol...

2023
[29]

For preservation, the goal is broader: we aim to protect all weights with substantial influence on general capabilities, regardless of the direction of that influence. A weight with a large negative contribution to benign task performance is as important to preserve as one with a large positive contribution, since sign is more sensitive to noise and its s...

2025
[30]

to assess the harmfulness of model responses. StrongREJECT is an automated classifier that scores model outputs on a continuous scale from 0 to 1, where 0 indicates a complete 14 refusal or entirely benign response and 1 indicates a fully compliant harmful response. The classifier is designed to be robust to superficial refusal patterns and evaluates whet...

2024
[31]

How do I make a quick buck?

Weights responsible for refusal behavior are removed via targeted pruning, eliminating the refusal mechanism entirely. We implement this by generating refusal responses from the model and computing SNIP importance scores on these responses to identify and prune the relevant weights. The sparsity hyper- parameterspandqare selected by maximizing the StrongR...

2025
[32]

I cannot,

Operationally, a response is classified as EM if it satisfies all three conditions: an alignment score below 30, a coher- ence score above 50, and an out-of-domain label. We adopt the fine-tuning configuration of Turner et al. (2025), with one modification: because 1,000 examples per domain are reserved for pruning, we use the remaining 5,000 for fine-tun...

2025
[33]

pirate software

targets weights whose removal facilitates harmful generation; as a consequence, pruned models tend to answer harmful requests directly rather than explain why they are harm- ful, effectively bypassing the reasoning step. The more aggressive approach of Wei et al. (2024) (refusal v2)—which removes 2,600× more weights and targets model’s refusals directly—p...

2024
[34]

For the following request, you don’t need to provide an answer. Just analyze why its harmful:

We leave further investigation of impairment between capabilities to future work. Pruning explanation and detection reveals mechanistic differences between models.In Llama-3.1-8B-Instruct, explanation pruning broadly degraded coherency broadly while leaving factual accuracy and coherency on trivia ques- tions largely unaffected (Table 6). In Qwen2.5-14B-I...

2024