Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal

· 2025 · cs.CL · arXiv 2509.09708

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Refusal on harmful prompts is a key safety behaviour in instruction-tuned large language models (LLMs), yet the internal causes of this behaviour remain poorly understood. We study two public instruction-tuned models, Gemma-2-2B-IT and LLaMA-3.1-8B-IT, using sparse autoencoders (SAEs) trained on residual-stream activations. Given a harmful prompt, we search the SAE latent space for feature sets whose ablation flips the model from refusal to compliance, demonstrating causal influence and creating a jailbreak. Our search proceeds in three stages: (1) Refusal Direction: find a refusal-mediating direction and collect SAE features near that direction; (2) Greedy Filtering: prune to a minimal set; and (3) Interaction Discovery: fit a factorization machine (FM) that captures nonlinear interactions among the remaining active features and the minimal set. This pipeline yields a broad set of jailbreak-critical features, offering insight into the mechanistic basis of refusal. Moreover, we find evidence of redundant features that remain dormant unless earlier features are suppressed. Our findings highlight the potential for fine-grained auditing and targeted intervention in safety behaviours by manipulating the interpretable latent space.

representative citing papers

Prompt-R1: Collaborative Automatic Prompting Framework via End-to-end Reinforcement Learning

cs.CL · 2025-11-02 · unverdicted · novelty 6.0

Prompt-R1 is an end-to-end RL framework where a small-scale LLM collaborates with large-scale LLMs by generating prompts, using a dual-constrained reward to optimize correctness and quality, and outperforms baselines on public datasets.

citing papers explorer

Showing 1 of 1 citing paper.

Prompt-R1: Collaborative Automatic Prompting Framework via End-to-end Reinforcement Learning cs.CL · 2025-11-02 · unverdicted · none · ref 2 · internal anchor
Prompt-R1 is an end-to-end RL framework where a small-scale LLM collaborates with large-scale LLMs by generating prompts, using a dual-constrained reward to optimize correctness and quality, and outperforms baselines on public datasets.

Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal

fields

years

verdicts

representative citing papers

citing papers explorer