Refusal in language models is mediated by a single direction.Advances in Neural Information Processing Systems, 37:136037–136083

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, Neel Nanda · 2024

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

browse 3 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders

cs.LG · 2026-05-07 · conditional · novelty 6.0

Sparse autoencoders isolate unstable features in reward model representations and enable two mitigation techniques that reduce preference errors on perturbed inputs without retraining.

Semantic Denial of Service in LLM-controlled robots

cs.CR · 2026-04-25 · unverdicted · novelty 6.0

Injecting brief safety-plausible phrases into robot audio triggers LLM safety halts, enabling semantic denial-of-service attacks where prompt defenses trade attack suppression for impaired genuine hazard detection.

Towards Effective Theory of LLMs: A Representation Learning Approach

cs.LG · 2026-05-10 · unverdicted · novelty 5.0

RET learns temporally consistent macrovariables from LLM activations via self-supervised learning to support interpretability, early behavioral prediction, and causal intervention.

citing papers explorer

Showing 3 of 3 citing papers.

Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders cs.LG · 2026-05-07 · conditional · none · ref 1
Sparse autoencoders isolate unstable features in reward model representations and enable two mitigation techniques that reduce preference errors on perturbed inputs without retraining.
Semantic Denial of Service in LLM-controlled robots cs.CR · 2026-04-25 · unverdicted · none · ref 24
Injecting brief safety-plausible phrases into robot audio triggers LLM safety halts, enabling semantic denial-of-service attacks where prompt defenses trade attack suppression for impaired genuine hazard detection.
Towards Effective Theory of LLMs: A Representation Learning Approach cs.LG · 2026-05-10 · unverdicted · none · ref 4
RET learns temporally consistent macrovariables from LLM activations via self-supervised learning to support interpretability, early behavioral prediction, and causal intervention.

Refusal in language models is mediated by a single direction.Advances in Neural Information Processing Systems, 37:136037–136083

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer