and Su, Pei-Hao and Vandyke, David and Wen, Tsung-Hsien and Young, Steve

Thomson, Blaise, Rojas-Barahona, Lina M · 2016 · DOI 10.18653/v1/n16-1018

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open at publisher browse 1 citing papers

representative citing papers

Adversarial Robustness of Activation Steering in Large Language Models

cs.LG · 2026-06-05 · unverdicted · novelty 7.0

First systematic test shows activation steering robustness drops sharply (up to 64%) under adversarial input perturbations across multiple extraction methods, models, and personas.

citing papers explorer

Showing 1 of 1 citing paper.

Adversarial Robustness of Activation Steering in Large Language Models cs.LG · 2026-06-05 · unverdicted · none · ref 59
First systematic test shows activation steering robustness drops sharply (up to 64%) under adversarial input perturbations across multiple extraction methods, models, and personas.

and Su, Pei-Hao and Vandyke, David and Wen, Tsung-Hsien and Young, Steve

fields

years

verdicts

representative citing papers

citing papers explorer