STEER is a gradient-guided attack that iteratively translates refusal-triggering words into low-resource languages to jailbreak LLMs, reaching 93-96.7% success on open models and 35.5% transfer to GPT-4o-mini.
Sandwich attack: Multi-language mixture adaptive attack on LLMs.arXiv preprint arXiv:2404.07242,
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.AI 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Safety Targeted Embedding Exploit via Refinement
STEER is a gradient-guided attack that iteratively translates refusal-triggering words into low-resource languages to jailbreak LLMs, reaching 93-96.7% success on open models and 35.5% transfer to GPT-4o-mini.