pith. sign in

Refusal in language models is mediated by a single direction.Advances in Neural Information Processing Systems, 37:136037–136083

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

citation-role summary

background 1

citation-polarity summary

fields

cs.LG 2 cs.CR 1

years

2026 3

roles

background 1

polarities

background 1

representative citing papers

Semantic Denial of Service in LLM-controlled robots

cs.CR · 2026-04-25 · unverdicted · novelty 6.0

Injecting brief safety-plausible phrases into robot audio triggers LLM safety halts, enabling semantic denial-of-service attacks where prompt defenses trade attack suppression for impaired genuine hazard detection.

citing papers explorer

Showing 3 of 3 citing papers.