Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal

Amir Abdullah; Erik Cambria; Nirmalendu Prakash; Ranjan Satapathy; Roy Ka Wei Lee; Yeo Wei Jie

arxiv: 2509.09708 · v3 · submitted 2025-09-07 · 💻 cs.CL · cs.AI

Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal

Nirmalendu Prakash , Yeo Wei Jie , Amir Abdullah , Ranjan Satapathy , Erik Cambria , Roy Ka Wei Lee This is my paper

Pith reviewed 2026-05-18 18:52 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords large language modelsrefusal behaviorsparse autoencodersjailbreakingmechanistic interpretabilityAI safetyfeature ablation

0 comments

The pith

Ablating specific SAE features in LLMs flips refusal to compliance on harmful prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines the internal mechanisms behind refusal in instruction-tuned models by training sparse autoencoders on residual-stream activations from Gemma-2-2B-IT and LLaMA-3.1-8B-IT. It introduces a three-stage search process to locate sets of SAE features whose removal causes the model to produce compliant answers instead of refusals. This demonstrates that particular latent features exert causal control over safety behavior. A sympathetic reader would care because the method offers a way to audit and intervene in refusal at fine granularity rather than through full model retraining. The work also identifies redundant features that stay inactive until primary ones are removed.

Core claim

Given a harmful prompt, the authors locate a refusal-mediating direction in SAE latent space, collect nearby features, apply greedy filtering to retain a minimal active set, and fit a factorization machine to capture nonlinear interactions. Ablating the resulting feature set reliably shifts the model output from refusal to compliance, creating a jailbreak while revealing redundant dormant features that activate only after suppression of earlier ones.

What carries the argument

Three-stage SAE feature search pipeline: refusal-direction identification, greedy filtering to a minimal set, and factorization-machine modeling of feature interactions.

If this is right

Ablating the identified features creates jailbreaks by converting refusal into compliance on harmful inputs.
Redundant refusal features exist and become active only after primary features are suppressed.
The latent space permits fine-grained auditing and targeted modification of safety behaviors.
Nonlinear interactions among features can be recovered and used to predict jailbreak effectiveness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same search pipeline could be applied to other model behaviors such as sycophancy or hallucination.
Addressing feature redundancy might lead to more robust safety interventions that survive partial ablation.
One could test whether the discovered features transfer across model scales or training recipes.

Load-bearing premise

Ablating the chosen SAE latents produces a targeted shift in refusal behavior rather than a general change to the model's output distribution or other capabilities.

What would settle it

A direct test would be to ablate the reported feature sets on held-out harmful prompts and measure whether refusal rate drops significantly while performance on unrelated tasks and non-harmful prompts remains unchanged.

Figures

Figures reproduced from arXiv: 2509.09708 by Amir Abdullah, Erik Cambria, Nirmalendu Prakash, Ranjan Satapathy, Roy Ka Wei Lee, Yeo Wei Jie.

**Figure 2.** Figure 2: Shown above is part of the computation flow in a decoder only LLM. Attached to a layer is a SAE. Square boxes [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Figure shows harm types based on the Coconot unsafe taxonomy and the count of feature activations against each type (first four show individual types and remaining on the right show count of features which fire on multiple harm types.) Next, we map the features found to a “safety specific noncompliance” taxonomy. Mapping causal features to unsafe taxonomy We map each features found on LLAMA to one or more… view at source ↗

**Figure 4.** Figure 4: Figure shows cosine similarities across layers [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Figure shows cosine similarities across layers [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Jailbreak performance with Baseline Method (Se [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: UI presented to an annotator. For each feature, top five activating samples along with top activating token and corre [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

read the original abstract

Refusal on harmful prompts is a key safety behaviour in instruction-tuned large language models (LLMs), yet the internal causes of this behaviour remain poorly understood. We study two public instruction-tuned models, Gemma-2-2B-IT and LLaMA-3.1-8B-IT, using sparse autoencoders (SAEs) trained on residual-stream activations. Given a harmful prompt, we search the SAE latent space for feature sets whose ablation flips the model from refusal to compliance, demonstrating causal influence and creating a jailbreak. Our search proceeds in three stages: (1) Refusal Direction: find a refusal-mediating direction and collect SAE features near that direction; (2) Greedy Filtering: prune to a minimal set; and (3) Interaction Discovery: fit a factorization machine (FM) that captures nonlinear interactions among the remaining active features and the minimal set. This pipeline yields a broad set of jailbreak-critical features, offering insight into the mechanistic basis of refusal. Moreover, we find evidence of redundant features that remain dormant unless earlier features are suppressed. Our findings highlight the potential for fine-grained auditing and targeted intervention in safety behaviours by manipulating the interpretable latent space.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives a three-stage SAE pipeline to locate ablatable features tied to refusal in two models, but the causal specificity of the ablations is not clearly established.

read the letter

The main takeaway is a concrete search method that starts with a refusal direction in SAE space, prunes via greedy selection, and then uses a factorization machine to surface interactions among the remaining features. They apply it to Gemma-2-2B-IT and LLaMA-3.1-8B-IT and report redundant dormant features that only activate once the primary ones are removed. That combination and the redundancy observation are the clearest additions on top of standard SAE work on safety behaviors.

Referee Report

2 major / 2 minor

Summary. The manuscript investigates the mechanistic basis of refusal behavior in instruction-tuned LLMs (Gemma-2-2B-IT and LLaMA-3.1-8B-IT) via sparse autoencoders on residual stream activations. It describes a three-stage pipeline to identify SAE feature sets whose ablation flips model output from refusal to compliance on harmful prompts: (1) locating a refusal-mediating direction and nearby features, (2) greedy pruning to a minimal set, and (3) fitting a factorization machine to capture nonlinear interactions. The work reports a broad set of jailbreak-critical features and evidence of redundant features that activate only after suppression of others.

Significance. If the ablation results prove specific to refusal circuitry rather than generic disruption, the paper would advance understanding of how safety alignments are represented in LLMs and demonstrate the value of SAEs for fine-grained behavioral auditing and intervention.

major comments (2)

[Pipeline description (stages 1-3)] The description of the three-stage search and ablation procedure provides no controls or baselines for non-specific effects (e.g., ablating an equal number of randomly selected SAE latents or latents aligned to a different behavioral direction). This is load-bearing for the central causal claim that the discovered feature sets produce a targeted flip from refusal to compliance rather than broad output-distribution shifts or degradation of instruction-following.
[Results on redundant features] The reported finding of redundant features that remain dormant unless earlier features are suppressed lacks quantitative details such as activation statistics, compliance rate changes, or examples from the two studied models, weakening assessment of its implications for the interaction model.

minor comments (2)

[Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., compliance rates or number of features retained after pruning) to convey the scale of the empirical findings.
[Interaction Discovery stage] Clarify the precise formulation and fitting procedure for the factorization machine in the interaction discovery stage to make the nonlinear modeling steps reproducible.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments, which help clarify the requirements for strengthening the causal claims in our work on refusal mechanisms via SAEs. We address each major comment below and outline revisions to improve the manuscript.

read point-by-point responses

Referee: [Pipeline description (stages 1-3)] The description of the three-stage search and ablation procedure provides no controls or baselines for non-specific effects (e.g., ablating an equal number of randomly selected SAE latents or latents aligned to a different behavioral direction). This is load-bearing for the central causal claim that the discovered feature sets produce a targeted flip from refusal to compliance rather than broad output-distribution shifts or degradation of instruction-following.

Authors: We agree that controls for non-specific effects are important to substantiate the targeted nature of the identified feature sets. Our greedy filtering and factorization machine stages are intended to isolate minimal, interacting sets rather than arbitrary disruptions, but explicit baselines would strengthen the evidence. In the revised manuscript, we will add ablation experiments using equal-sized sets of randomly selected SAE latents as well as latents aligned to unrelated directions (e.g., a helpfulness or factual recall direction). We will report resulting compliance rates, refusal rates, and metrics for output quality or instruction-following to demonstrate specificity to refusal circuitry. revision: yes
Referee: [Results on redundant features] The reported finding of redundant features that remain dormant unless earlier features are suppressed lacks quantitative details such as activation statistics, compliance rate changes, or examples from the two studied models, weakening assessment of its implications for the interaction model.

Authors: We thank the referee for highlighting the need for more detail on the redundant features. The manuscript reports evidence of such features activating only after primary ones are suppressed, supporting the interaction model, but additional quantification will aid evaluation. In the revision, we will expand this section with activation statistics (pre- and post-suppression means across prompts), compliance rate deltas for individual versus joint ablations, and concrete examples from both Gemma-2-2B-IT and LLaMA-3.1-8B-IT, including sample harmful prompts and model outputs illustrating the redundancy effect. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical search pipeline is self-contained

full rationale

The paper presents an empirical pipeline consisting of refusal-direction identification in SAE latent space, greedy pruning to a minimal feature set, and fitting a factorization machine to model interactions. No load-bearing step reduces by construction to its own inputs, nor does any claimed causal result or jailbreak effect equate to a fitted parameter or self-citation chain. The central demonstration relies on experimental ablations whose outcomes are measured against model behavior, remaining falsifiable and independent of the fitting procedure itself. This qualifies as a normal non-circular empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Abstract-only review; the work rests on standard mechanistic-interpretability assumptions about SAE latents and the validity of activation ablation as a causal intervention. No explicit free parameters or invented entities are named.

axioms (2)

domain assumption SAE latents capture causally relevant directions in residual-stream activations
Invoked by the decision to search and ablate within the SAE feature space rather than raw activations.
domain assumption Ablation of a small set of latents can isolate refusal behavior without global side-effects
Required for the claim that the discovered sets demonstrate causal influence on refusal.

pith-pipeline@v0.9.0 · 5760 in / 1371 out tokens · 43656 ms · 2026-05-18T18:52:13.747935+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our search proceeds in three stages: (1) Refusal Direction: find a refusal-mediating direction and collect SAE features near that direction; (2) Greedy Filtering: prune to a minimal set; and (3) Interaction Discovery: fit a factorization machine (FM) that captures nonlinear interactions
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we also find evidence of redundant features which remain dormant unless earlier features are suppressed

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Prompt-R1: Collaborative Automatic Prompting Framework via End-to-end Reinforcement Learning
cs.CL 2025-11 unverdicted novelty 6.0

Prompt-R1 is an end-to-end RL framework where a small-scale LLM collaborates with large-scale LLMs by generating prompts, using a dual-constrained reward to optimize correctness and quality, and outperforms baselines ...

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · cited by 1 Pith paper · 9 internal anchors

[1]

, " * write output.state after.block = add.period write newline

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page
[3]

Anil, C.; Durmus, E.; Panickssery, N.; Sharma, M.; Benton, J.; Kundu, S.; Batson, J.; Tong, M.; Mu, J.; Ford, D.; et al. 2024. Many-shot jailbreaking. Advances in Neural Information Processing Systems, 37: 129696--129742

work page 2024
[4]

Arditi, A.; Obeso, O.; Syed, A.; Paleka, D.; Panickssery, N.; Gurnee, W.; and Nanda, N. 2024. Refusal in language models is mediated by a single direction. arXiv preprint arXiv:2406.11717

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Bai, Y.; Jones, A.; Ndousse, K.; Askell, A.; Chen, A.; DasSarma, N.; Drain, D.; Fort, S.; Ganguli, D.; Henighan, T.; et al. 2022 a . Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

Bai, Y.; Kadavath, S.; Kundu, S.; Askell, A.; Kernion, J.; Jones, A.; Chen, A.; Goldie, A.; Mirhoseini, A.; McKinnon, C.; et al. 2022 b . Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

Belinkov, Y. 2022. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1): 207--219

work page 2022
[8]

Bien, J.; Taylor, J.; and Tibshirani, R. 2013. A lasso for hierarchical interactions. Annals of statistics, 41(3): 1111

work page 2013
[9]

Brahman, F.; Kumar, S.; Balachandran, V.; Dasigi, P.; Pyatkin, V.; Ravichander, A.; Wiegreffe, S.; Dziri, N.; Chandu, K.; Hessel, J.; et al. 2024. The art of saying no: Contextual noncompliance in language models. Advances in Neural Information Processing Systems, 37: 49706--49748

work page 2024
[10]

E.; Hume, T.; Carter, S.; Henighan, T.; and Olah, C

Bricken, T.; Templeton, A.; Batson, J.; Chen, B.; Jermyn, A.; Conerly, T.; Turner, N.; Anil, C.; Denison, C.; Askell, A.; Lasenby, R.; Wu, Y.; Kravec, S.; Schiefer, N.; Maxwell, T.; Joseph, N.; Hatfield-Dodds, Z.; Tamkin, A.; Nguyen, K.; McLean, B.; Burke, J. E.; Hume, T.; Carter, S.; Henighan, T.; and Olah, C. 2023. Towards Monosemanticity: Decomposing L...

work page 2023
[11]

Chen, J.; Wang, X.; Yao, Z.; Bai, Y.; Hou, L.; and Li, J. 2024. Finding safety neurons in large language models. arXiv preprint arXiv:2406.14144

work page arXiv 2024
[12]

Chughtai, B.; Cooney, A.; and Nanda, N. 2024. Summing up the facts: Additive mechanisms behind factual recall in llms. arXiv preprint arXiv:2402.07321

work page arXiv 2024
[13]

Cunningham, H.; Ewart, A.; Riggs, L.; Huben, R.; and Sharkey, L. 2023. Sparse autoencoders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

Dutta, S.; Singh, J.; Chakrabarti, S.; and Chakraborty, T. 2024. How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning. arXiv preprint arXiv:2402.18312

work page arXiv 2024
[15]

Elhage, N.; Nanda, N.; Olsson, C.; Henighan, T.; Joseph, N.; Mann, B.; Askell, A.; Bai, Y.; Chen, A.; Conerly, T.; DasSarma, N.; Drain, D.; Ganguli, D.; Hatfield-Dodds, Z.; Hernandez, D.; Jones, A.; Kernion, J.; Lovitt, L.; Ndousse, K.; Amodei, D.; Brown, T.; Clark, J.; Kaplan, J.; McCandlish, S.; and Olah, C. 2021. A Mathematical Framework for Transforme...

work page 2021
[16]

Ganguli, A

Ganguli, D.; Askell, A.; Schiefer, N.; Liao, T. I.; Luko s i \=u t \.e , K.; Chen, A.; Goldie, A.; Mirhoseini, A.; Olsson, C.; Hernandez, D.; et al. 2023. The capacity for moral self-correction in large language models. arXiv preprint arXiv:2302.07459

work page arXiv 2023
[17]

S.; Oksuz, K.; Joy, T.; Torr, P.; Sanyal, A.; and Dokania, P

Jain, S.; Lubana, E. S.; Oksuz, K.; Joy, T.; Torr, P.; Sanyal, A.; and Dokania, P. 2024. What makes and breaks safety fine-tuning? a mechanistic study. Advances in Neural Information Processing Systems, 37: 93406--93478

work page 2024
[18]

Lindsey, J.; Gurnee, W.; Ameisen, E.; Chen, B.; Pearce, A.; Turner, N. L.; Citro, C.; Abrahams, D.; Carter, S.; Hosmer, B.; Marcus, J.; Sklar, M.; Templeton, A.; Bricken, T.; McDougall, C.; Cunningham, H.; Henighan, T.; Jermyn, A.; Jones, A.; Persic, A.; Qi, Z.; Thompson, T. B.; Zimmerman, S.; Rivoire, K.; Conerly, T.; Olah, C.; and Batson, J. 2025. On th...

work page 2025
[19]

Mazeika, M.; Phan, L.; Yin, X.; Zou, A.; Wang, Z.; Mu, N.; Sakhaee, E.; Li, N.; Basart, S.; Li, B.; et al. 2024. HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal. arXiv preprint arXiv:2402.04249

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

McGrath, T.; Rahtz, M.; Kramar, J.; Mikulik, V.; and Legg, S. 2023. The hydra effect: Emergent self-repair in language model computations. arXiv preprint arXiv:2307.15771

work page arXiv 2023
[21]

O'Brien, K.; Majercak, D.; Fernandes, X.; Edgar, R.; Chen, J.; Nori, H.; Carignan, D.; Horvitz, E.; and Poursabzi-Sangde, F. 2024. Steering language model refusal with sparse autoencoders. arXiv preprint arXiv:2411.11296

work page arXiv 2024
[22]

Olah, C.; Cammarata, N.; Schubert, L.; Goh, G.; Petrov, M.; and Carter, S. 2020. Zoom In: An Introduction to Circuits. Distill. Https://distill.pub/2020/circuits/zoom-in

work page 2020
[23]

Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 27730--27744

work page 2022
[24]

Paulo, G.; and Belrose, N. 2025. Sparse Autoencoders Trained on the Same Data Learn Different Features. arXiv preprint arXiv:2501.16615

work page arXiv 2025
[25]

Perez, E.; Huang, S.; Song, F.; Cai, T.; Ring, R.; Aslanides, J.; Glaese, A.; McAleese, N.; and Irving, G. 2022. Red teaming language models with language models. arXiv preprint arXiv:2202.03286

work page internal anchor Pith review Pith/arXiv arXiv 2022
[26]

D.; Ermon, S.; and Finn, C

Rafailov, R.; Sharma, A.; Mitchell, E.; Manning, C. D.; Ermon, S.; and Finn, C. 2023. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36: 53728--53741

work page 2023
[27]

Rendle, S. 2010. Factorization machines. In 2010 IEEE International conference on data mining, 995--1000. IEEE

work page 2010
[28]

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

R \"o ttger, P.; Kirk, H. R.; Vidgen, B.; Attanasio, G.; Bianchi, F.; and Hovy, D. 2023. Xstest: A test suite for identifying exaggerated safety behaviours in large language models. arXiv preprint arXiv:2308.01263

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Sharkey, L.; Braun, D.; and Millidge, B. 2022. [Interim research report] Taking features out of superposition with sparse autoencoders. https://www.alignmentforum.org/posts/z6QQJbtpkEAX3Aojj/interim-research-report-taking-features-out-of-superposition. Accessed: 2025-01-29

work page 2022
[30]

Shen, G.; Zhao, D.; Feng, L.; He, X.; Wang, J.; Shen, S.; Tong, H.; Dong, Y.; Li, J.; Zheng, X.; et al. 2025. PANDAGUARD: Systematic Evaluation of LLM Safety against Jailbreaking Attacks. arXiv preprint arXiv:2505.13862

work page arXiv 2025
[31]

Skalse, J.; Howe, N.; Krasheninnikov, D.; and Krueger, D. 2022. Defining and characterizing reward gaming. Advances in Neural Information Processing Systems, 35: 9460--9471

work page 2022
[32]

Tan, S.; Hooker, G.; Koch, P.; Gordo, A.; and Caruana, R. 2023. Considerations when learning additive explanations for black-box models. Machine Learning, 112(9): 3333--3359

work page 2023
[33]

Team, L. 2024. Meta Llama Guard 2. https://github.com/meta-llama/PurpleLlama/blob/main/Llama-Guard2/MODEL_CARD.md

work page 2024
[34]

L.; McDougall, C.; MacDiarmid, M.; Freeman, C

Templeton, A.; Conerly, T.; Marcus, J.; Lindsey, J.; Bricken, T.; Chen, B.; Pearce, A.; Citro, C.; Ameisen, E.; Jones, A.; Cunningham, H.; Turner, N. L.; McDougall, C.; MacDiarmid, M.; Freeman, C. D.; Sumers, T. R.; Rees, E.; Batson, J.; Jermyn, A.; Carter, S.; Olah, C.; and Henighan, T. 2024. Scaling Monosemanticity: Extracting Interpretable Features fro...

work page 2024
[35]

Wang, K.; Variengien, A.; Conmy, A.; Shlegeris, B.; and Steinhardt, J. 2022. Interpretability in the wild: a circuit for indirect object identification in gpt-2 small. arXiv preprint arXiv:2211.00593

work page internal anchor Pith review Pith/arXiv arXiv 2022
[36]

Wei, A.; Haghtalab, N.; and Steinhardt, J. 2023. Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36: 80079--80110

work page 2023
[37]

M.; Huang, K.; He, L.; Wei, B.; Li, D.; Sheng, Y.; et al

Xie, T.; Qi, X.; Zeng, Y.; Huang, Y.; Sehwag, U. M.; Huang, K.; He, L.; Wei, B.; Li, D.; Sheng, Y.; et al. 2024. Sorry-bench: Systematically evaluating large language model safety refusal. arXiv preprint arXiv:2406.14598

work page arXiv 2024
[38]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Zou, A.; Wang, Z.; Carlini, N.; Nasr, M.; Kolter, J. Z.; and Fredrikson, M. 2023. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

, " * write output.state after.block = add.period write newline

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

work page

[2] [2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[3] [3]

Anil, C.; Durmus, E.; Panickssery, N.; Sharma, M.; Benton, J.; Kundu, S.; Batson, J.; Tong, M.; Mu, J.; Ford, D.; et al. 2024. Many-shot jailbreaking. Advances in Neural Information Processing Systems, 37: 129696--129742

work page 2024

[4] [4]

Arditi, A.; Obeso, O.; Syed, A.; Paleka, D.; Panickssery, N.; Gurnee, W.; and Nanda, N. 2024. Refusal in language models is mediated by a single direction. arXiv preprint arXiv:2406.11717

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Bai, Y.; Jones, A.; Ndousse, K.; Askell, A.; Chen, A.; DasSarma, N.; Drain, D.; Fort, S.; Ganguli, D.; Henighan, T.; et al. 2022 a . Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862

work page internal anchor Pith review Pith/arXiv arXiv 2022

[6] [6]

Bai, Y.; Kadavath, S.; Kundu, S.; Askell, A.; Kernion, J.; Jones, A.; Chen, A.; Goldie, A.; Mirhoseini, A.; McKinnon, C.; et al. 2022 b . Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073

work page internal anchor Pith review Pith/arXiv arXiv 2022

[7] [7]

Belinkov, Y. 2022. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1): 207--219

work page 2022

[8] [8]

Bien, J.; Taylor, J.; and Tibshirani, R. 2013. A lasso for hierarchical interactions. Annals of statistics, 41(3): 1111

work page 2013

[9] [9]

Brahman, F.; Kumar, S.; Balachandran, V.; Dasigi, P.; Pyatkin, V.; Ravichander, A.; Wiegreffe, S.; Dziri, N.; Chandu, K.; Hessel, J.; et al. 2024. The art of saying no: Contextual noncompliance in language models. Advances in Neural Information Processing Systems, 37: 49706--49748

work page 2024

[10] [10]

E.; Hume, T.; Carter, S.; Henighan, T.; and Olah, C

Bricken, T.; Templeton, A.; Batson, J.; Chen, B.; Jermyn, A.; Conerly, T.; Turner, N.; Anil, C.; Denison, C.; Askell, A.; Lasenby, R.; Wu, Y.; Kravec, S.; Schiefer, N.; Maxwell, T.; Joseph, N.; Hatfield-Dodds, Z.; Tamkin, A.; Nguyen, K.; McLean, B.; Burke, J. E.; Hume, T.; Carter, S.; Henighan, T.; and Olah, C. 2023. Towards Monosemanticity: Decomposing L...

work page 2023

[11] [11]

Chen, J.; Wang, X.; Yao, Z.; Bai, Y.; Hou, L.; and Li, J. 2024. Finding safety neurons in large language models. arXiv preprint arXiv:2406.14144

work page arXiv 2024

[12] [12]

Chughtai, B.; Cooney, A.; and Nanda, N. 2024. Summing up the facts: Additive mechanisms behind factual recall in llms. arXiv preprint arXiv:2402.07321

work page arXiv 2024

[13] [13]

Cunningham, H.; Ewart, A.; Riggs, L.; Huben, R.; and Sharkey, L. 2023. Sparse autoencoders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600

work page internal anchor Pith review Pith/arXiv arXiv 2023

[14] [14]

Dutta, S.; Singh, J.; Chakrabarti, S.; and Chakraborty, T. 2024. How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning. arXiv preprint arXiv:2402.18312

work page arXiv 2024

[15] [15]

Elhage, N.; Nanda, N.; Olsson, C.; Henighan, T.; Joseph, N.; Mann, B.; Askell, A.; Bai, Y.; Chen, A.; Conerly, T.; DasSarma, N.; Drain, D.; Ganguli, D.; Hatfield-Dodds, Z.; Hernandez, D.; Jones, A.; Kernion, J.; Lovitt, L.; Ndousse, K.; Amodei, D.; Brown, T.; Clark, J.; Kaplan, J.; McCandlish, S.; and Olah, C. 2021. A Mathematical Framework for Transforme...

work page 2021

[16] [16]

Ganguli, A

Ganguli, D.; Askell, A.; Schiefer, N.; Liao, T. I.; Luko s i \=u t \.e , K.; Chen, A.; Goldie, A.; Mirhoseini, A.; Olsson, C.; Hernandez, D.; et al. 2023. The capacity for moral self-correction in large language models. arXiv preprint arXiv:2302.07459

work page arXiv 2023

[17] [17]

S.; Oksuz, K.; Joy, T.; Torr, P.; Sanyal, A.; and Dokania, P

Jain, S.; Lubana, E. S.; Oksuz, K.; Joy, T.; Torr, P.; Sanyal, A.; and Dokania, P. 2024. What makes and breaks safety fine-tuning? a mechanistic study. Advances in Neural Information Processing Systems, 37: 93406--93478

work page 2024

[18] [18]

Lindsey, J.; Gurnee, W.; Ameisen, E.; Chen, B.; Pearce, A.; Turner, N. L.; Citro, C.; Abrahams, D.; Carter, S.; Hosmer, B.; Marcus, J.; Sklar, M.; Templeton, A.; Bricken, T.; McDougall, C.; Cunningham, H.; Henighan, T.; Jermyn, A.; Jones, A.; Persic, A.; Qi, Z.; Thompson, T. B.; Zimmerman, S.; Rivoire, K.; Conerly, T.; Olah, C.; and Batson, J. 2025. On th...

work page 2025

[19] [19]

Mazeika, M.; Phan, L.; Yin, X.; Zou, A.; Wang, Z.; Mu, N.; Sakhaee, E.; Li, N.; Basart, S.; Li, B.; et al. 2024. HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal. arXiv preprint arXiv:2402.04249

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

McGrath, T.; Rahtz, M.; Kramar, J.; Mikulik, V.; and Legg, S. 2023. The hydra effect: Emergent self-repair in language model computations. arXiv preprint arXiv:2307.15771

work page arXiv 2023

[21] [21]

O'Brien, K.; Majercak, D.; Fernandes, X.; Edgar, R.; Chen, J.; Nori, H.; Carignan, D.; Horvitz, E.; and Poursabzi-Sangde, F. 2024. Steering language model refusal with sparse autoencoders. arXiv preprint arXiv:2411.11296

work page arXiv 2024

[22] [22]

Olah, C.; Cammarata, N.; Schubert, L.; Goh, G.; Petrov, M.; and Carter, S. 2020. Zoom In: An Introduction to Circuits. Distill. Https://distill.pub/2020/circuits/zoom-in

work page 2020

[23] [23]

Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 27730--27744

work page 2022

[24] [24]

Paulo, G.; and Belrose, N. 2025. Sparse Autoencoders Trained on the Same Data Learn Different Features. arXiv preprint arXiv:2501.16615

work page arXiv 2025

[25] [25]

Perez, E.; Huang, S.; Song, F.; Cai, T.; Ring, R.; Aslanides, J.; Glaese, A.; McAleese, N.; and Irving, G. 2022. Red teaming language models with language models. arXiv preprint arXiv:2202.03286

work page internal anchor Pith review Pith/arXiv arXiv 2022

[26] [26]

D.; Ermon, S.; and Finn, C

Rafailov, R.; Sharma, A.; Mitchell, E.; Manning, C. D.; Ermon, S.; and Finn, C. 2023. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36: 53728--53741

work page 2023

[27] [27]

Rendle, S. 2010. Factorization machines. In 2010 IEEE International conference on data mining, 995--1000. IEEE

work page 2010

[28] [28]

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

R \"o ttger, P.; Kirk, H. R.; Vidgen, B.; Attanasio, G.; Bianchi, F.; and Hovy, D. 2023. Xstest: A test suite for identifying exaggerated safety behaviours in large language models. arXiv preprint arXiv:2308.01263

work page internal anchor Pith review Pith/arXiv arXiv 2023

[29] [29]

Sharkey, L.; Braun, D.; and Millidge, B. 2022. [Interim research report] Taking features out of superposition with sparse autoencoders. https://www.alignmentforum.org/posts/z6QQJbtpkEAX3Aojj/interim-research-report-taking-features-out-of-superposition. Accessed: 2025-01-29

work page 2022

[30] [30]

Shen, G.; Zhao, D.; Feng, L.; He, X.; Wang, J.; Shen, S.; Tong, H.; Dong, Y.; Li, J.; Zheng, X.; et al. 2025. PANDAGUARD: Systematic Evaluation of LLM Safety against Jailbreaking Attacks. arXiv preprint arXiv:2505.13862

work page arXiv 2025

[31] [31]

Skalse, J.; Howe, N.; Krasheninnikov, D.; and Krueger, D. 2022. Defining and characterizing reward gaming. Advances in Neural Information Processing Systems, 35: 9460--9471

work page 2022

[32] [32]

Tan, S.; Hooker, G.; Koch, P.; Gordo, A.; and Caruana, R. 2023. Considerations when learning additive explanations for black-box models. Machine Learning, 112(9): 3333--3359

work page 2023

[33] [33]

Team, L. 2024. Meta Llama Guard 2. https://github.com/meta-llama/PurpleLlama/blob/main/Llama-Guard2/MODEL_CARD.md

work page 2024

[34] [34]

L.; McDougall, C.; MacDiarmid, M.; Freeman, C

Templeton, A.; Conerly, T.; Marcus, J.; Lindsey, J.; Bricken, T.; Chen, B.; Pearce, A.; Citro, C.; Ameisen, E.; Jones, A.; Cunningham, H.; Turner, N. L.; McDougall, C.; MacDiarmid, M.; Freeman, C. D.; Sumers, T. R.; Rees, E.; Batson, J.; Jermyn, A.; Carter, S.; Olah, C.; and Henighan, T. 2024. Scaling Monosemanticity: Extracting Interpretable Features fro...

work page 2024

[35] [35]

Wang, K.; Variengien, A.; Conmy, A.; Shlegeris, B.; and Steinhardt, J. 2022. Interpretability in the wild: a circuit for indirect object identification in gpt-2 small. arXiv preprint arXiv:2211.00593

work page internal anchor Pith review Pith/arXiv arXiv 2022

[36] [36]

Wei, A.; Haghtalab, N.; and Steinhardt, J. 2023. Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36: 80079--80110

work page 2023

[37] [37]

M.; Huang, K.; He, L.; Wei, B.; Li, D.; Sheng, Y.; et al

Xie, T.; Qi, X.; Zeng, Y.; Huang, Y.; Sehwag, U. M.; Huang, K.; He, L.; Wei, B.; Li, D.; Sheng, Y.; et al. 2024. Sorry-bench: Systematically evaluating large language model safety refusal. arXiv preprint arXiv:2406.14598

work page arXiv 2024

[38] [38]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Zou, A.; Wang, Z.; Carlini, N.; Nasr, M.; Kolter, J. Z.; and Fredrikson, M. 2023. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043

work page internal anchor Pith review Pith/arXiv arXiv 2023