SRA achieves 99.71% average attack success across 26 LLMs by optimizing for coherent malicious semantics via the SRHS algorithm, with claimed theoretical guarantees on convergence and transfer.
Semantic representation attack against aligned large language models
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
citation-role summary
extension 1
citation-polarity summary
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1roles
extension 1polarities
extend 1representative citing papers
citing papers explorer
-
LLM-Agnostic Semantic Representation Attack
SRA achieves 99.71% average attack success across 26 LLMs by optimizing for coherent malicious semantics via the SRHS algorithm, with claimed theoretical guarantees on convergence and transfer.